This lab will serve as your initial introduction to the R programming language and RStudio development environment. We will work on this over the course of the week.
For new users, R can be an intimidating programming language to learn, especially compared to other popular data analysis tools:
At the same time, there are some really good reasons to gain familiarity with R and RStudio, particularly within the context of urban planning analysis.
Thinking about the type of analysis we will do in this class, there are some additional rationales for learning and working in R and RStudio:
With the basic rationale laid out, let’s start exploring the logic behind the R language:
By the end of this lab, you should be familiar with:
As a programming language, R was initially designed to be run in a terminal console. You can still run R in this fashion, if you wish. RStudio is an integrated development environment for R - in addition to providing us with a terminal window in which we could run commands, it also provides additional windows for viewing data and visualizations.
Here is one way to think of the relationship between R and RStudio"
If you grew up in the 1980s (as your humble professor did), you might consider the DeLorean to be a fairly cool car. This is R - it can do cool things, especially if you know how to use it well. Modify the DeLorean with lots of extra features (including time travel) and you have RStudio - add ons that build upon the strengths of R (so far as we know, RStudio hasn’t been used for time travel yet).
It is important that we keep these things in mind, because they will help us to understand what its like to interact with R via RStudio. Let’s start with a very basic way of interacting with R. Here is what the basic RStudio interface looks like:
We are working in a special kind of document called an R Notebook. Our R Notebook allows us to integrate plain text with chunks of code.
The benefit of working in a notebook is that you can run code in line with your text, and see the results integrated with your writing kind of like a scientific lab notebook. Some people will use R Notebooks to write reports, since they can render tables and figures in line with their text, and these can easily be updated if new data or parameters are supplied.
This is what a code chunk looks like:
```r
```r
# Code Chunk
Any content inside of this code chunk will be interpreted by your R session in the terminal when you hit the green play button to the right. You can also step through each line of code by putting your cursor to the right of it and hitting command+enter (Mac) or control+enter (PC) - I strongly recommend you get into the habit of running code this way at first.
You can create new code chunks by pressing control+option+I (Mac) or control+alt+I (PC).
In the above code chunk, we have some text preceded by a hashtag (#). Any content to the right of a hashtag (# groundbreaking insightful comment) will be considered a comment and will not be interpreted as code. Comments are a great way to make short notes to remind yourself or others of what you’re doing:
```r
```r
x
Let’s start by entering a simple command - let’s add together 2 and 2 and ask R to return the product.
2+2
[1] 4
Entering 2+2 into our console window and then hitting command/control+enter asks R to process the request we have given it - it then gives us back an answer to our request. We could do the same thing with other simple numeric operators:
+ Addition- Subtraction* Multiplication/ Division^ Exponents (e.g. 2^3 = 8)() Parentheses - to control order of operations (e.g. ```(2+3)/5 = 1)We can do basic math in a console - not terribly exciting, but at least this helps you to see how R will respond to basic commands:
```r
```r
col2
Now your turn - create a code chunk in line here (remember, control + option + I), and perform some simple math operations. Also explore how R handles order of operations.
3^4
[1] 81
In most cases, we don’t want to just type things into the console and then get an answer - we’d be just as well served with a calculator. Our next step is to understand that R can store the output of a command for later use. The most basic way to do this is to assign our output to an object. we can do this using the <- assignment operator:
```r
```r
str(ozs)
Let’s learn how to speak this out. We just told R, into an object we have (arbitrarily) named “x”, store the output of 2+2. Because this is now stored, we can retrieve it and use it later. If we simply ask for “x” R will share with us the previously assigned output - 2+2
Option+- (Mac) or Alt + - (PC) is the shortcut for inserting the assignment operator.
```r
# Your Work Here
<!-- rnb-source-end -->
<!-- rnb-chunk-end -->
<!-- rnb-text-begin -->
This means that we could also use this output in other formulas. Let's see what happens if we square X:
<!-- rnb-text-end -->
<!-- rnb-chunk-begin -->
<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxueF4yXG5cbmBgYCJ9 -->
```r
x^2
[1] 16
Since X is 4, we get the output that is the equivalent of typing 4^2.
It is important to note here that we can provide any type of label we’d like for an R variable. Instead of using “x” as a variable name, we could use anything else.
Assign the sum of 4+6 into a variable named “cat”.
We just assigned to a variable called “cat” the product of 4+6. To retrieve the value of your assigned variable, you can just call it by name:
cat
R allows you to name variables as you wish. Note that variables need to start with a character, and cannot start with a number (e.g. 1_Numbers would not work). Also note that you will want to avoid variable names that are the same as R functions (so naming a variable “mean” for instance, would not be a good idea, as this would cause confusion with the function mean() which calculates the average of a vector).
We can of course work with multiple variables at once:
cat+x
[1] 14
In this case, cat is 10 and x is 4 (you can see the values stored in objects in the environment pane). Let’s divide cat+x by x:
(cat+x)/x
[1] 3.5
This is great - we have a calculator that can store and make use of values as objects. Not so exciting for neighborhood analysis just yet, though…
The next thing to note is that objects don’t have to be single values. We could also assign lists of values to an object:
Note here that c() (formally the concatenate function) is used to denote that we have a list. Each list item it separated by commas. If we call up this object, we can have a look at our list:
col1
Working with a single object, we could do things by using the object in a formula. We can do the same with a list:
col1+2
[1] 4 5 6 7 8
To each list item, we added 2. We could even store this as a new object if we wanted 2
Writing this out, we told r “Place into an object called”col2" the product of adding 2 to each item in the sequence contained in “col1”.
col2
Cool! We can manipulate our list items all at once.
We can also have R automatically create sequences of numbers for us, if they follow a regular pattern using the seq() command:
seq(0,100, 5)
[1] 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
This says create a list containing the sequence of numbers from 0 to 100 counting by fives.
Try creating your own sequence - count up from 4 to 24 by 4
seq(4,24, 4)
[1] 4 8 12 16 20 24
Try creating your another sequence - count down from 50 to 2 by 4. How would you do this?
seq(50,2, -4)
[1] 50 46 42 38 34 30 26 22 18 14 10 6 2
But we digress - back to our existing lists. What would happen if we decided to multiply col1 by col2?
col1*col2
[1] 8 15 24 35 48
Can you see what happens here? Since our lists are the same size, R multiples the first item in col1 by the first item in col2, the second item in col1 by the second item in col2, and so on - e.g. (24, 35 …)
Lists, however, don’t have to be just numeric - they can be other types of things as well:
c("Black Cat", "Brown Dog", "Dappled Donkey", "Red Rooster") is a character vector containing four items.We’ll talk about some other types of vectors later, but these are sufficient to get you started. In addition to numeric vectors, probably the most common other type of vectors we will encounter are character vectors. Let’s make a list of the items we need to make a single serving of oatmeal (your professor is hungry as he writes this lab):
c("Oatmeal", "Water", "Salt", "Sugar")
In a separate list, let’s place the quantity of ingredients:
c(1/2, 1, .25, 1)
Perhaps it would be useful to make a third list with the unit of measure for the quantity of ingredients:
c("Cup", "Cup", "Teaspoon", "Tablespoon")
Okay, we have three lists, that we might be able to use for different things. Write code that assigns the list of ingredients to a new object called “ingredients”, write the quantities to a new object called “quantity”, and write the units to a new object called “units”.
We could do some interesting things here, like paste together the different list items into something approximating a recipe:
paste(quantity, units, ingredients, sep=" ")
[1] "0.5 Cup Oatmeal" "1 Cup Water" "0.25 Teaspoon Salt"
[4] "1 Tablespoon Sugar"
What is sep = " " doing here? What would happen if you changed sep to a comma?
You can take a look at the documentation for the paste() command by typing ?paste.
Verbalizing what we just asked R to do (a valuable habit for problem solving more complex functions and data manipulation later on), we said “paste together the list items contained in the variables quantity, units, and ingredients, placing a space between each of the list items.”
We can also manipulate list objects - let’s say you volunteer to host a community meeting and need to make 45 portions of your oatmeal recipe - how would you go about constructing your grocery list? Below, write out the operations that you would need to do to modify your existing list of quantities to account for 45 portions (let’s assume that ingredient quantities remain the same when we scale up our recipe):
Since you’re new at this, here are few ways to do this (I hope you’ve tried on your own to figure it out on your own before reading on) - you could either modify the quantities in the quantity vector by multiplying them directly and creating a new vector:
quantity45 <- quantity*45
paste(quantity45, units, ingredients, sep=" ")
Alternatively, you could modify the list directly in your paste command:
paste(quantity*45, units, ingredients, sep=" ")
[1] "22.5 Cup Oatmeal" "45 Cup Water" "11.25 Teaspoon Salt"
[4] "45 Tablespoon Sugar"
The outputs are exactly the same.
R can also help us to pick out list items. The brackets [] allow us to return list items by position (left to right).
ex_list[4]
[1] "Robert"
What would happen if we put [-4] instead of 4?
ex_list[-4]
[1] "Jane" "Jacobs" "beats" "Moses" "in" "a" "fight"
[8] "for" "New York"
Now you try selecting the tenth element from the list.
ex_list[10]
[1] "New York"
We can also select multiple elements at the same time:
ex_list[c(4, 5, 3, 1, 2)]
[1] "Robert" "Moses" "beats" "Jane" "Jacobs"
We created a list c() and placed it in brackets which told R that we wanted to return the values of ex_list that corresponded to the positions in our other list c(4, 5, 3, 1, 2).
Now you try creating and manipulating a list.
ex_list[c(4, 8, 9, 10)]
Now re-create the sequence you crafted earlier (count up from 4 to 24 by 4) and subset out the fifth element from that numeric sequence:
list[5]
[1] 20
Fun (maybe), but not yet particularly useful. You didn’t take this class because you wanted to scale oatmeal recipes or identify numbers in a sequence. The more powerful stuff is coming up! - these are building blocks to teach some of the logic of the language.
The next thing for us to think about is how we might combine lists together. Thinking back to our oatmeal recipe, right now we have three separate lists each (respectively) with our quantity, units, and ingredients. We’ve figured out that we can paste together items from different lists, but it might be nice to be able to store them in one object rather than three. This is a good time to introduce the R data frame, which is the object you’ll be dealing with the most.
Start by looking at R’s internal documentation on data frames (?data.frame):
:
?data.frame
Now lets coerce our three lists into a single data frame called “oatmeal”:
We created a new object called “oatmeal” that has bound our three lists together into a data frame. We need to specify stringsAsFactors = FALSE to keep R from turning our strings (characters) into a special data type called factors (more on these later). R assumes that we want our columns labeled with the original list object names.
We can now look at our list as a series of columns that have been given the name of the variable they were stored in as a list, and each row represents one of the list items. An important concept to keep in mind is that a data frame is a series of lists that are in essence glued together.
We can refer to and access rows and columns in our data frame in several ways. If we want to return those items in a specific column if the list, we can use the $ operator to refer to that item:
oatmeal$ingredients
We just returned the list items that were in the column named ingredients. We could also use our subset notation to retrieve the same things. This subset notation builds upon what we learned when we accessed list items by position (e.g. units[2]):
knitr::include_graphics("Images/04_guru99_dataframe.png")
While subsets of lists require one number corresponding to the index position, data frames have two dimensions - rows and columns, so we need to be able to differentiate between each. R does this using a comma in the subset notation [row, column]. If we want all rows or columns, we can just leave that portion of the bracket empty. For instance, the code below is the equivalent of typing oatmeal$ingredients since ingredients are the third column in the oatmeal data frame:
oatmeal[,3]
[1] "Oatmeal" "Water" "Salt" "Sugar"
This is because ingredients is the third column in the oatmeal data frame.
Now you try: query the second row of the oatmeal data frame:
How would you query the third row of the second column?
Ok, so we see that we can create subsets fairly easily, either using column names or index positions in our dataset.
We can add new columns to our data frame. Oftentimes when we are working with data, we’ll need to calculate a new column based upon the values in other columns. We have our data frame with a recipe for 1 serving of oatmeal. Let’s say we frequently need to make 45 servings of oatmeal (its famous, and the reason why people show up to your 7am neighborhood meetings…), so you want to include that quantity alongside the single serving quantity.
Let’s create a new column called “quantity45” and add to it the quantity of ingredients for a 45 serving batch of oatmeal:
Note that we need to refer to the original quantity by pointing to the oatmeal data frame as well. Let’s verbalize this to think about what we’re doing. “Into a new column in the oatmeal data frame called quantity45 (oatmeal$quantity45 <-), write the value contained in the oatmeal data frame called quantity multiplied by 45 (oatmeal$quantity*45).”
Now you try - create a new column called “instructions” in the oatmeal data table that contains our recipe quantity for 45 portions of oatmeal, units, and ingredients pasted together (this will require you reference data using concepts we learned earlier):
oatmeal$instructions <- paste(quantity45, units, ingredients)
Error in paste(quantity45, units, ingredients) :
object 'quantity45' not found
Note again that we need to refer to each of the specific columns in the oatmeal data frame using their appropriate vector (e.g. oatmeal$ingredients. Note that in this case if we omitted the pointer to oatmeal, R would assume we wanted to do something with the list called ingredients. In this case, that would actually work, but in most cases, we will just have a data frame and won’t have a separate list stored as an R object - we’d get an error.
We can also pull out all items meeting a specific criteria in our data frame - let’s say we want to look at those ingredients that are measured in cups:
This looks weird - we’ve introduced some new notation here. Let’s first speak this out and then we can learn more about the notation. “From the oatmeal data frame, return those rows from within the oatmeal data frame for which the value of the units column is equal to”Cup" (the equal sign in R is == two equal signs together).Notice also that the word “Cup” has parentheses surrounding it, denoting that it is a character string. The square brackets [] denote that we’re looking for something (or some things) within the oatmeal data frame.
While in this case, we’re looking for rows that meet a specific criteria based upon the word “Cup” (searching for a character string), we could return subsets of numeric records in other ways:
Returns those records from the oatmeal data frame for which the quantity value is greater than .5. If we wanted to include our .5 cups of water, we could specify >= (greater than or equal to).
Let’s stop playing around with hypothetical oatmeal recipes, and get our hands on some “real world” data!
We’re going to work with data on those census tracts that were designated as Opportunity Zones as part of the federal Tax Cuts and Jobs Act. These incentives are designed to spur investment in low-income and under-capitalized cities, by providing investors with tax incentives to invest capital in these locations.
We’re going to practice working with data frames and tibbles by using data census tracts that are designated as Opportunity Zones by the US Treasury. The 2017 Tax Cuts and Jobs Act included a new federal incentive to spur investment in low-income and undercapitalized communities. Each state had the opportunity to designate specific census tracts as Opportunity Zones. Practitioners and researchers have many questions about the efficacy of the program and the designations made by governors.
Take a look here to see a map of where opportunity zones are located. The pink geometries reflected on the map are census tracts, which we often use as a proxy for neighborhoods, especially in urban areas. Find a place you know, and take a look at which areas are designated.
A copy of the Urban Institute’s dataset (which contains the underlying data for this map as well as some other analysis) is downloaded and is in the “data” folder as part of this lab’s project. We could have R download it directly, but that’s for another class session!
We’re been working primarily in “base” R as we are getting familiar with the R language and RStudio interface. To import data in an Excel file, we’ll introduce tools that extend the functionality of the base R syntax.
Let’s start by loading a package which will help us to work through loading data. Specifically, we’ll load the readxl package which contains tools which will help us to read Excel files into R.
We need to first install the readxl package - this is a one-time operation. When we install a new package, we ask R to download it from the internet and add it to our library of packages. We use install.packages() to install a package.
After a package is installed, we need to load it so that it is available to use in our current R session. We need to do this once per session. To stay organized, I typically include code to install all of the packages that I’ll need at the very beginning of my script. That way when I open and run a script, those packages will be ready to go. We use the library() command to load packages.
Go ahead and put your new knowledge to use. Use install.packages() to download the readxl packages and then use library() to load it into for your R session:
install.packages("readxl")
trying URL 'https://cran.rstudio.com/bin/macosx/contrib/4.1/readxl_1.3.1.tgz'
Content type 'application/x-gzip' length 1678740 bytes (1.6 MB)
==================================================
downloaded 1.6 MB
The downloaded binary packages are in
/var/folders/1v/yd1ly0pj7891nkn5syts2ny40000gn/T//RtmpLaIVtl/downloaded_packages
Note that to install the package, you need to treat the package name as a character vector "readxl", but when you load it in your R session, it does not need to be treated as a character vectorreadxl.
Now that we’ve loaded the packages that we need to work with, we can import the excel file containing data on tracts designated as opportunity zones in the United States.
We downloaded the readxl package, which is designed to read Excel files into R data frames. You can either do a Google search for Readxl to find documentation, or you can use R’s built in documentation by typing ?readxl
?Readxl
No documentation for ‘Readxl’ in specified packages and libraries:
you could try ‘??Readxl’
As the documentation states, readxl imports excel files. Looking at the documentation, the read_excel() command will read a single excel sheet, or we can optionally select a sheet by name or number from an excel workbook with multiple sheets. In this case, the Urban Institute data is in a workbook with a single sheet, so we just need to tell R where the file is to load.
The dataset we want to load is called “urbaninstitute_tractlevelozanalysis_update1242018.xlsx” (oof! - that’s a descriptive but way too long file name!). We can point R to the correct location. Since our R project file sets the relative path for all of the work within, the path to the data is:"data/urbaninstitute_tractlevelozanalysis_update1242018.xlsx". Wrapped into the command to read the excel file, it looks like this:
read_excel("data/urbaninstitute_tractlevelozanalysis_update1242018.xlsx")
R read the data and is displaying it to us. Thats well and good, but let’s save the data table as an object.
Read the Excel data into an object called “ozs”:
ozs <- data.frame[read_excel("data/urbaninstitute_tractlevelozanalysis_update1242018.xlsx")]
Error in data.frame[read_excel("data/urbaninstitute_tractlevelozanalysis_update1242018.xlsx")] :
object of type 'closure' is not subsettable
Look at your Environment window (top right quadrant of RStudio) - a data frame containing information on opportunity zones should be loaded in an object called “ozs”.
The environment window tells us that the object ozs contains 42,176 observations (rows) and 27 variables (columns).
Inspect the dataset using the View() command. This will allow us to look at the data in a tabular format.
Now, use the str() (structure) command to gain a better understanding of the types of data in each column
str(ozs)
tibble [42,176 × 27] (S3: tbl_df/tbl/data.frame)
$ geoid : chr [1:42176] "01001020200" "01001020300" "01001020700" "01001020802" ...
$ state : chr [1:42176] "Alabama" "Alabama" "Alabama" "Alabama" ...
$ Designated : num [1:42176] NA NA 1 NA NA NA NA 1 NA 1 ...
$ county : chr [1:42176] "Autauga County" "Autauga County" "Autauga County" "Autauga County" ...
$ Type : chr [1:42176] "Low-Income Community" "Non-LIC Contiguous" "Low-Income Community" "Non-LIC Contiguous" ...
$ dec_score : num [1:42176] 4 6 9 10 5 6 6 9 10 9 ...
$ SE_Flag : num [1:42176] NA NA NA NA NA NA NA NA NA NA ...
$ Population : num [1:42176] 2196 3136 3047 10743 2899 ...
$ medhhincome2014_tract: num [1:42176] 41107 51250 45234 61242 49567 ...
$ PovertyRate : num [1:42176] 0.24 0.107 0.19 0.153 0.151 ...
$ unemprate : num [1:42176] 0.0775 0.051 0.1407 0.0459 0.0289 ...
$ medvalue : num [1:42176] 95300 113800 93500 160400 102900 ...
$ medrent : num [1:42176] 743 817 695 1018 546 ...
$ pctown : num [1:42176] 0.628 0.703 0.711 0.823 0.83 ...
$ severerentburden : num [1:42176] 0.3269 0.3223 0.3887 0.1994 0.0994 ...
$ vacancyrate : num [1:42176] 0.0584 0.1399 0.0619 0.0609 0.2182 ...
$ pctwhitealone : num [1:42176] 0.439 0.671 0.833 0.814 0.726 ...
$ pctblackalone : num [1:42176] 0.5187 0.205 0.0922 0.1572 0.2456 ...
$ pctHispanic : num [1:42176] 0.01275 0.0727 0.0338 0.01368 0.00448 ...
$ pctAAPIalone : num [1:42176] 0.01093 0.01052 0 0.00959 0 ...
$ pctunder18 : num [1:42176] 0.218 0.224 0.249 0.27 0.245 ...
$ pctover64 : num [1:42176] 0.124 0.175 0.149 0.122 0.156 ...
$ HSorlower : num [1:42176] 0.581 0.464 0.544 0.45 0.621 ...
$ BAorhigher : num [1:42176] 0.162 0.219 0.113 0.229 0.136 ...
$ Metro : num [1:42176] 1 1 1 1 1 1 1 1 1 1 ...
$ Micro : num [1:42176] NA NA NA NA NA NA NA NA NA NA ...
$ NoCBSAType : num [1:42176] NA NA NA NA NA NA NA NA NA NA ...
We get a list of the columns in the data, along with their types (in this case character or numeric), and then we see the values associated with the first few observations.
A few things to note after your preliminary inspection:
NA where the tract was not designated.R has several functions for determining the structure of data frames and tibbles. See below:
Size
dim(ozs): returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)nrow(ozs): returns the number of rowsncol(ozs): returns the number of columnsContent
head(ozs): shows the first 6 rowstail(ozs): shows the last 6 rowsNames
names(ozs): returns the column namesSummary
str(ozs): structure of the object and information about the class, length and content of each columnsummary(ozs): summary statistics for each columnTry your hand at some of these summarization methods to see what they produce:
summary(ozs)
geoid state Designated county
Length:42176 Length:42176 Min. :1 Length:42176
Class :character Class :character 1st Qu.:1 Class :character
Mode :character Mode :character Median :1 Mode :character
Mean :1
3rd Qu.:1
Max. :1
NA's :33414
Type dec_score SE_Flag Population
Length:42176 Min. : 1.000 Min. :1 Min. : 0
Class :character 1st Qu.: 3.000 1st Qu.:1 1st Qu.: 2752
Mode :character Median : 5.000 Median :1 Median : 3897
Mean : 5.495 Mean :1 Mean : 4147
3rd Qu.: 8.000 3rd Qu.:1 3rd Qu.: 5224
Max. :10.000 Max. :1 Max. :40616
NA's :1253 NA's :41111 NA's :112
medhhincome2014_tract PovertyRate unemprate medvalue
Min. : 2499 Min. :0.0000 Min. :0.00000 Min. : 9999
1st Qu.: 32014 1st Qu.:0.1380 1st Qu.:0.05900 1st Qu.: 85700
Median : 41094 Median :0.2055 Median :0.08734 Median : 122400
Mean : 42153 Mean :0.2331 Mean :0.10063 Mean : 165663
3rd Qu.: 50833 3rd Qu.:0.2996 3rd Qu.:0.12600 3rd Qu.: 191300
Max. :181406 Max. :1.0000 Max. :1.00000 Max. :2000001
NA's :249 NA's :141 NA's :141 NA's :1106
medrent pctown severerentburden vacancyrate
Min. : 99.0 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.: 655.0 1st Qu.:0.3833 1st Qu.:0.1662 1st Qu.:0.07115
Median : 800.0 Median :0.5728 Median :0.2403 Median :0.11658
Mean : 860.9 Mean :0.5436 Mean :0.2476 Mean :0.14120
3rd Qu.:1010.0 3rd Qu.:0.7316 3rd Qu.:0.3206 3rd Qu.:0.18011
Max. :3501.0 Max. :1.0000 Max. :1.0000 Max. :1.00000
NA's :395 NA's :1033 NA's :189 NA's :167
pctwhitealone pctblackalone pctHispanic pctAAPIalone
Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.00000
1st Qu.:0.2040 1st Qu.:0.01072 1st Qu.:0.02602 1st Qu.:0.00000
Median :0.5614 Median :0.06656 Median :0.09304 Median :0.00883
Mean :0.5211 Mean :0.18652 Mean :0.22060 Mean :0.03806
3rd Qu.:0.8294 3rd Qu.:0.25000 3rd Qu.:0.32014 3rd Qu.:0.03533
Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :0.91144
NA's :131 NA's :131 NA's :131 NA's :131
pctunder18 pctover64 HSorlower BAorhigher Metro
Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :1
1st Qu.:0.1908 1st Qu.:0.09436 1st Qu.:0.4150 1st Qu.:0.1120 1st Qu.:1
Median :0.2300 Median :0.13604 Median :0.5182 Median :0.1679 Median :1
Mean :0.2295 Mean :0.14340 Mean :0.5067 Mean :0.2034 Mean :1
3rd Qu.:0.2719 3rd Qu.:0.18057 3rd Qu.:0.6113 3rd Qu.:0.2536 3rd Qu.:1
Max. :0.6468 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1
NA's :131 NA's :131 NA's :132 NA's :132 NA's :9111
Micro NoCBSAType
Min. :1 Min. :1
1st Qu.:1 1st Qu.:1
Median :1 Median :1
Mean :1 Mean :1
3rd Qu.:1 3rd Qu.:1
Max. :1 Max. :1
NA's :37448 NA's :37793
Now that we have a real live dataset loaded into R, we probably want to describe what it is. We can use a function like summary() to take a closer look at some of the properties of the data:
summary(ozs)
We get a column by column summary based upon the type of data which R “sees” in each column in the ozs data. For character columns, this includes the length (number of observations) and type of data (character in this case). For numeric columns, we get some descriptive statistics, including the number of records that area NA or have no value.
How would we run summaries just for population, median household income, and poverty rate (think back to how we created subsets using lists)?
summary(ozs[c("Population", "medhhincome2014_tract", "PovertyRate")])
Population medhhincome2014_tract PovertyRate
Min. : 0 Min. : 2499 Min. :0.0000
1st Qu.: 2752 1st Qu.: 32014 1st Qu.:0.1380
Median : 3897 Median : 41094 Median :0.2055
Mean : 4147 Mean : 42153 Mean :0.2331
3rd Qu.: 5224 3rd Qu.: 50833 3rd Qu.:0.2996
Max. :40616 Max. :181406 Max. :1.0000
NA's :112 NA's :249 NA's :141
Next up, practice your querying skills - how would we return only those records for census tracts with a median household income above $100,000 per year?
How would we query out tracts in Illinois?
We can see in our table output that there are 1,682 eligible or designated tracts in Illinois. We could also use the nrow() command to count the number of rows. Try crafting code that would count the number of rows for Illinois:
nrow(ozs[ozs$state == "Illinois",])
[1] 1682
We might also want to calculate statistics like averages for subsets. mean() will calculate the mean of a list or column. What’s the average income for tracts with a vacancy rate above 20 percent? What’s the average income for tracts with a vacancy rate below 20 percent?
mean(ozs$medhhincome2014_tract[ozs$vacancyrate < .2], na.rm =TRUE)
[1] 43849.18
You might need to check out the documentation for mean() in order to return an answer here. R will not calculate the mean if an NA values are present in the vector for which you’ve requested the mean - a good safety feature if you’re expecting all values to be present. In imperfect data like what we’re dealing with, you can instruct R to remove those NAs and find the mean for remaining values. You may also want to make sure you’ve counted the number of NA values so you know what proportion of your data the mean is actually representing.
We might also be interested in combining query criteria. How would we determine the average income for tracts in Illinois with a poverty rate of greater than 20 percent?
& is equivalent to AND and | is equivalent to OR. Now give it a go!mean(ozs$medhhincome2014_tract[ozs$PovertyRate > 0.2 & ozs$state == "Illinois"], na.rm=TRUE)
[1] 33526
To confirm we got the query correct, have a look at the returned data (with out calculating the mean):
ozs$medhhincome2014_tract[ozs$PovertyRate > 0.2 & ozs$state == "Illinois"]
[1] 28819 32313 17850 26012 40475 35387 40714 22326 21500 49590 40599 37679 26676
[14] 7004 NA 7273 5736 38056 38083 33873 49597 22813 30994 49303 48125 46364
[27] 7234 22688 44800 35904 21107 37823 36164 44460 30406 30682 22945 15020 38922
[40] 39000 29861 38861 29432 26750 29870 17569 36464 51458 64073 39349 40913 45349
[53] 51840 48450 40272 36848 39096 35463 43681 29091 27931 34671 35286 27018 36412
[66] 22647 23700 42316 38571 31696 43895 NA 42077 36159 42280 45417 51250 52074
[79] 47652 31364 42762 42232 36522 44750 46905 39926 37284 36607 36644 48316 40662
[92] 38693 49236 47000 44393 43912 55741 34375 43607 34340 45208 47344 52063 28558
[105] 52833 43717 56250 30503 32317 34933 36338 32232 26773 27788 51542 35698 40515
[118] 43650 69348 52096 28487 37227 31004 33873 31160 35625 34583 23516 43011 28631
[131] 28363 22708 34030 28311 20405 20620 22289 23264 33508 31029 24914 12036 25921
[144] 29306 22097 28867 21607 23214 20912 20563 30817 16989 23929 22150 32717 31181
[157] 17652 21250 20278 20565 58047 45625 71250 61389 23834 21516 25291 27019 36250
[170] 38054 40511 29120 29446 25099 26234 26375 31216 30205 37013 35764 33088 29914
[183] 45096 48083 29980 40719 34922 31875 33173 20990 11964 41645 11250 42324 11310
[196] 10942 23603 24500 19747 18063 21250 26000 34327 NA NA 32500 31602 18808
[209] 24178 20250 31938 25346 18859 21250 51875 35636 38260 52817 82667 19034 37610
[222] 29960 14500 22353 17731 20573 29688 21658 33849 20889 23606 15723 18125 19872
[235] 28214 19688 42292 32500 21362 22407 21727 22306 22619 33125 28984 22371 33281
[248] 31008 27353 29125 20944 43341 31518 20118 13536 23684 36938 41917 44514 51723
[261] 38750 36631 32430 35469 43083 26641 39276 26630 26443 39348 33304 37679 32277
[274] 34154 33750 35240 38857 24604 27073 42684 39500 41625 35089 25956 14556 14273
[287] 52485 44706 40199 39722 40743 40254 41081 38452 31688 49358 43698 32682 32377
[300] 31982 33631 21678 20761 22969 26727 26289 32949 27622 31250 26985 27880 25609
[313] 25855 23906 45000 32331 33509 40054 52879 40901 22422 32083 39942 27712 27156
[326] 35338 25846 31329 45313 35647 24408 22381 25786 27589 20119 24911 22000 34583
[339] 27266 19205 33947 26094 25133 24427 36042 42670 22844 27841 15485 15139 16862
[352] 22163 22972 35476 18875 26070 29844 34609 20431 28472 36818 25461 22250 14622
[365] 26180 24806 32961 27864 24045 22250 19561 31330 39005 32788 35573 35991 31867
[378] 51681 47656 50913 40125 43872 57692 61991 50726 49116 49518 48011 65357 55523
[391] 56975 41051 51699 61806 49268 30532 46988 62132 35500 44678 32027 40268 43021
[404] 44778 34835 35810 39139 40308 40201 38163 36250 47943 46172 39212 40083 39965
[417] 28508 22261 45901 44083 37941 40099 37885 39534 39766 46728 41196 49572 46739
[430] 27861 37374 32196 37334 44048 36693 37500 41449 44653 31683 41405 30769 31474
[443] 37281 17425 22857 23048 22650 41250 20500 17633 23761 30487 36613 37659 43529
[456] 39593 27214 14722 22242 44722 22956 31188 35577 33220 44592 46533 99375 21456
[469] 41902 42786 40286 35179 32500 34231 58500 50521 51442 23750 37179 31238 30161
[482] 38883 32204 20898 26071 21827 19054 26514 37723 19207 23667 NA 55813 18627
[495] 52717 25699 52308 30946 28806 43203 26319 13429 26023 27283 33250 38750 56908
[508] 17377 17370 38382 43516 13281 20294 19228 84780 29831 22279 42200 33942 45694
[521] 37891 31389 30759 50417 27185 32315 43409 25694 19000 22446 18582 32895 35528
[534] 39652 9485 35041 21053 17197 28000 42254 37353 24886 35875 31788 38060 27012
[547] 43780 74038 23302 16844 15242 54444 52368 38551 39306 34028 34375 41646 37772
[560] 39049 43156 36469 24922 28448 31907 38279 30656 35469 54293 43860 31845 42463
[573] 34167 46731 24886 29831 37917 10127 13101 15671 22108 33514 18411 28791 15000
[586] 47733 54483 44872 35329 40491 47273 36543 47908 49904 44732 49091 47981 38281
[599] 46639 49101 36830 26627 30705 26406 24423 21488 42832 23569 36806 41053 17938
[612] 22813 33579 23011 34970 38380 36824 60630 38188 36127 48339 22500 29408 37883
[625] 51121 22312 30547 34063 32134 52981 37361 45333 46020 32725 26861 45385 38625
[638] 35123 33250 41824 21739 NA 37443 33056 47813 55861 11053 37266 62966 37356
[651] 26224 40188 42537 51354 29479 26250 31250 40556 21613 26103 26667 14434 17577
[664] 28866 32083 48706 35191 39013 41739 23222 26432 25377 32083 55313 16010 45078
[677] 34382 22121 21701 41958 37261 44291 36618 40408 39514 20000 39792 30735 74554
[690] 36917 25781 37875 44094 29304 28125 24932 35179 32198 33144 31446 38144 30464
[703] 34314 46034 12474 29773 22121 21582 26591 12247 16337 18815 27599 27225 29762
[716] 44844 67000 31653 35291 41462 45929 20942 31635 42583 32396 28884 36902 24671
[729] 40298 38750 29677 36979 41037 32434 42917 42344 20741 36960 21969 25463 9836
[742] 17120 15394 28065 21849 24095 29875 41964 37750 28077 37357 34449 19650 27019
[755] 29250 33240 25134 29779 24486 9967 15036 27005 16353 20439 35398 28750 63750
[768] 36333 32188 38750 41886 30930 25994 38480 28294 20500 16667 31716 16750 16513
[781] 17500 24821 41966 31002 23214 32956 54387 32697 32260 52788 52768 25078 23371
[794] 29978 40506 19145 27165 32034 23514 43007 31639 33981 40532 39006 34708 37185
[807] 31250 37414 51767 31772 34732 31674 37386 27448 19518 22899 50052 40212 37792
[820] 40676 45962 35028 56190 38083 36794 30932 46587 27457 18983 14731 23594 34420
[833] 36053 27566 28004 14853 40737 20000 27397 19167 25791 26979 12816 21091 21330
[846] 27432 31477 34939 44792 35660 NA NA NA NA NA NA NA NA
[859] NA NA NA NA NA NA NA NA NA NA NA NA NA
[872] NA NA
You’ll note that there are several flag variables in the data for which values are either 1 or NA. We have flags for whether a tract was designated as an Opportunity Zone, Whether it is located in a Metropolitan, Micropolitan, or Non-Core-Based Statistical Area.
We can use logical tests in R to identify those values that are NA. We can use is.na() to test whether a value is NA (TRUE) or is not NA (FALSE). We could also use the negation sign ! to determine whether a value is not NA (!is.na()).
The following code returns a vector of logical values (TRUE / FALSE) regarding whether the value for the Designated column is NA or not.
is.na(ozs$Designated)
[1] TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
[14] TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE
[27] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
[40] TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
[53] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
[66] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[79] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
[92] FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
[105] TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE
[118] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
[131] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
[144] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
[157] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
[170] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
[183] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
[196] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
[209] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE
[222] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[235] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[248] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[261] TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
[274] TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[287] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[300] TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[313] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[326] FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE
[339] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[352] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
[365] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
[378] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
[391] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE
[404] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE
[417] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[430] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
[443] FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[456] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
[469] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE
[482] TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
[495] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
[508] FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
[521] FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[534] FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[547] TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
[560] TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
[573] TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
[586] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[599] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[612] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[625] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
[638] TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
[651] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
[664] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
[677] FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[690] TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
[703] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE
[716] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
[729] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
[742] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
[755] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[768] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
[781] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE
[794] FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE
[807] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[820] TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
[833] FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
[846] TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
[859] TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
[872] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
[885] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE
[898] FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[911] TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE
[924] FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
[937] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
[950] TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
[963] TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
[976] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
[989] FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE
[ reached getOption("max.print") -- omitted 41176 entries ]
For logical values, R codes 1 as TRUE and 0 as false, meaning if we wanted to count the number of undesginated tracts, we could simply ask for the sum of the values for which the logical test is true (the sum of the values that are NA):
sum(is.na(ozs$Designated))
[1] 33414
33,414 tracts were not designated.
Now count the number of tracts that were designated (where the value is not NA):
sum(!is.na(ozs$Designated))
[1] 8762
We might also want to recode those NA values to something else. We can use assignment and subset notation to replace na values with something else. Let’s replace those NAs in the Designated column with 0.
Can you inspect the table here to see what happened? In plain language, we told R “for those values of the column named Designated in the ozs data table where the values are NA, assign a new value of 0.”
Go ahead and do the same thing for the Metro, Micro, and NoCBSAType columns:
Now answer the following questions:
mean(ozs$PovertyRate[ozs$Designated ==1 & ozs$NoCBSAType ==1], na.rm=TRUE)
[1] 0.2357986
mean(ozs$vacancyrate[ozs$state == "Illinois" & ozs$Designated == 1], na.rm=TRUE)
[1] 0.1840626
mean(ozs$vacancyrate[ozs$state == "Illinois" & ozs$Designated == 0], na.rm=TRUE)
[1] 0.1210847
How many census tracts have populations over 10,000? What is the mean median household income for census tracts with populations over 10,000?
mean(ozs$medhhincome2014_tract[ozs$Population >= 10000], na.rm=TRUE)
[1] 51901.47
Congratulations! You are now well versed in the following: - The RStudio console and R language - R data types and structures - Basic data manipulation and querying
Welcome to the nerd zone, my friends…
Moving forward, we’ll explore some more advanced strategies for summarizing data.